feat(elasticsearch): optional ingest_pipeline for bulk writes#3252
Open
GunaPalanivel wants to merge 12 commits intodeepset-ai:mainfrom
Open
feat(elasticsearch): optional ingest_pipeline for bulk writes#3252GunaPalanivel wants to merge 12 commits intodeepset-ai:mainfrom
GunaPalanivel wants to merge 12 commits intodeepset-ai:mainfrom
Conversation
Pass Elasticsearch bulk pipeline when set so ingest pipelines (e.g. inference) run at index time. Serialize in to_dict; omit when unset. Extend retriever serialization tests. Fixes deepset-ai#2940 Made-with: Cursor
Contributor
Coverage report (elasticsearch)Click to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||||||||||||||||||||||||||
Add filter edge-case tests (comparisons, ranges, in/not in) and retriever init validation tests. filters.py reaches 100% line coverage; combined package unit coverage ~63%. Made-with: Cursor
Contributor
|
@GunaPalanivel, you did not add any integration tests against the ElasticSearch cloud instance. I will do it and can take it over from here. |
Contributor
Author
@davidsbatista - if you are fine I'll add integration tests against the ElasticSearch cloud instance. |
Contributor
No need, thanks, your contribution was already helpful! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds an optional
ingest_pipelineparameter toElasticsearchDocumentStore. When set,write_documentsandwrite_documents_asyncpass it as the Elasticsearch bulk APIpipelineargument so ingest pipelines (including inference processors) can run at index time. Default behavior is unchanged (None).Why
Users who index into Elasticsearch with a pre-defined ingest pipeline had no way to attach that pipeline from Haystack. The parent discussion in #699 defers broader retrieval work; this change is the narrow index-time hook only.
How
ingest_pipelineonElasticsearchDocumentStore.__init__: non-empty strings only (after strip); whitespace-only raisesValueError.helpers.bulk/helpers.async_bulkreceivepipeline=...only when the value is set, so deletes and existing callers are unaffected.to_dict/from_dictinclude the field likesparse_vector_field(missing key deserializes toNone).Testing
Unit tests cover serialization (
test_to_dict,from_dictvariants), validation, and mocked bulk calls to assertpipelineis passed or omitted. Retrieverto_dict/from_dictexpectations were updated for nested document store serialization.To reproduce locally:
cd integrations/elasticsearch hatch run test:unit hatch run test:typesFor formatting on Windows, if
hatch run fmtfails on path handling, use:Test output:
Type check:
Trade-offs
pipelineoverride onwrite_documents(only store-level). Can be added later if needed.